Day 15: 從爬蟲到架站-爬取年度數據

第 12 屆 iThome 鐵人賽

DAY 1

自我挑戰組

從爬蟲到架站系列第 16 篇

12th鐵人賽

jeff3071

2020-09-17 22:58:44

1148 瀏覽

分享至

從爬蟲開始，到這邊開始程式碼開始多了起來，命名也需要統一格式了，像是近況爬蟲叫recent_stat_crawler，但是月份的叫month_crawler的話，在修改上會開始眼花撩亂。

在寫文章前就統一了，所以之前的文章都是統一過後的

crawl.py

async def send_year_req(url):
    res = await loop.run_in_executor(None, requests.get, url)
    soup = BeautifulSoup(res.text, 'lxml')
    data_type = soup.select('th')
    player_stat = soup.select('td')
    for i in range(len(player_stat)//31):
        player_dict = {}
        for x in range(31):
            player_dict[data_type[x].text.strip()] = player_stat[i*31 + x].text.strip()
        year_result.append(player_dict)
        
def get_player_year_stat():
    url = "http://www.cpbl.com.tw/stats/all.html?&game_type=01&&stat=pbat&year=2020&online=1&per_page="
    for i in range(5):
        full_url = url + str(i+1)
        task = loop.create_task(send_year_req(full_url))
        tasks.append(task)
    loop.run_until_complete(asyncio.wait(tasks))
    store_year(year_result)

這次改掉之前每一種數據都宣告一次的爛方法了，先爬數據有幾種，接著用dict存好每個數據的值，而31這個數字是循環，因為是用td取，而每個選手有31個元素，從背號、名子到打擊率等等。

db_connect.py

def store_year(data):
    db = firestore.client()
    batch = db.batch()
    for player in data:
        doc_ref = db.collection(u'打者').document(str(player['NAME']))
        del player['NAME']
        doc = doc_ref.get()
        if doc.exists and doc.to_dict()['年度'] != {}:
            batch.update(doc_ref, {u'年度':player})
        else:
            doc_ref.set({u'年度':player}, merge = True)
    batch.commit()

存到db的方法也是跟之前一樣。

接下來是一些雜談，這個網站是我從八月開始發想到實作的，文章到這邊大概是追上我的進度了，大部分時間都是為了優化爬蟲，從多線程到異步，而圖表部分原本是打算用pyechart的，但是後來學長建議可以直接用echart，最後幾天可能可以來寫個，下一篇一樣會把這個功能完善，下一個功能我希望能做球隊勝率的折線圖。